The data was sent to you before the course.
Please place this file on your desktop and unzip it.
We are going to be working on 2 files. Please open:
1. Beginners_R_Plotting_Practicals.Rmd
2. Beginners_R_Plotting.html
Beginners_R_Plotting.html is a html file, and should open in your default internet browser.
Beginners_R_Plotting_Practicals.Rmd is a R markdown file, and should have opened up in RStudio.
We need to alter how the output is displayed for Beginners_R_Plotting_Practicals.Rmd in RStudio.
We want the output to be displayed in the console NOT within the R markdown file.
To change this:
Click on the Gear Icon > Select "Chunk Output in Console"
Beginners_R_Plotting_Practicals.Rmd has a table of contents created using the # which creates headers in Markdown.
To access this: press the dashed line button on the upper right corner of the markdown file.
Classic example called “Anscombe’s Quartet”, contains four data sets with nearly the same exact statistics:
However when the data is visualized the true differences between the data sets are revealed.
ggplot2 package implements the ethos of the Leland Wilkinson’s book published in 1999. Works on idea of layers, where things can be constantly added to the plot by adding another layer.
General command format for ggplot2:
library(ggplot2)
ggplot(data = mtcars) +
geom_point(mapping = aes(x=mpg, y=hp))
OR
ggplot(data = mtcars, mapping = aes(x=mpg, y=hp)) +
geom_point()
Both of these syntaxes are fine
First, we would need to load the ggplot2 library. However, if we have loaded the tidyverse library, then ggplot2 is already loaded.
This is because tidyverse package is a package of packages, does not contain its own functions per se but includes a variety of other packages, like dplyr, tidyr, ggplot2. Therefore by installing and loading the tidyverse package, we install and load all of these other packages as well.
Using the diamonds dataset:
REMEMBER basic syntax for a scatter plot in ggplot is this:
ggplot(data = mtcars, mapping = aes(x=mpg, y=hp)) +
geom_point()
Preferable to start with a clean tidy dataset.
Getting the data into the correct format for making a plot is sometime the biggest challenge of data visualization.
Think about what you want the plot to look like (what variables being plotted to which aesthetics), and then work to format the data to achieve that visualization.
There are two different types of reasons to create a plot:
Think about what your visualization is communicating.
In ggplot terminology, aesthetics refers to the data/variable linked to an attribute.
Attributes are:
Aesthetic link to variables must be within aes()
ggplot(diamonds, aes(x = depth, y = table, color = cut )) +
geom_point()
If not linked to a variable, aesthetic applied to every point equally, it is an attribute
ggplot(diamonds, aes(x = depth, y = table)) +
geom_point(color = "blue")
Not all aesthetics are available for all plot types (geometries) (for instance, y axis for histograms).
Check the documentation for geoms here.
Using the diamonds dataset:
Why do you think this is happening?
The type of plot used to visualize data depends on the type of data being plotted. Not all graphs are appropriate for all data types.
ggplot2 offers a wide variety of different plot types, referred to as geoms.
A histogram or a density plot are the most common for a single continuous variable trait.
To make a histogram in ggplot2, we use the geom_histogram() command.
ggplot(diamonds, aes(x = price)) +
geom_histogram()
See here for more information about options in geom_histogram.
To make a density plot in ggplot2, we use the geom_density() command.
ggplot(diamonds, aes(x = price)) +
geom_density()
See here for more information about options in geom_density.
Using the diamonds dataset:
What is binwidth doing?
Does the plot and interpretation become different as you alter the binwidth?
A bar plot is the most common for a single discrete variable trait.
Do not use bar plots to plot DISTRIBUTIONS, instead plot the individual data points or think about using another type of plot type like box, violin or just visual error bars. This is a very good blog post about why not to use bar plots for this purpose
To make a bar chart in ggplot2, we use the geom_bar() command.
ggplot(diamonds, aes(x = cut)) +
geom_bar()
See here for more information about options in geom_bar.
Make a bar chart using the diamonds dataset. Select a non-numeric variable to chart.
To make a scatter plot in ggplot2, we use the geom_point() command.
ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point()
See here for more information about options in geom_point.
To make a jitter plot in ggplot2, we use the geom_jitter() command.
ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_jitter()
geom_jitter creates a scatter plot but “adds a small amount of random variation to the location of each point, and is a useful way of handling over plotting caused by discreteness in smaller data sets.”
See here for more information about options in geom_jitter.
To make a plot with a smoothing line in ggplot2, we use the geom_smooth() command.
ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_smooth()
geom_smooth adds a smoothing line to the plot, and there are various different statistical models that can be used to create this line.
See here for more information about options in geom_smooth.
To make a scatter plot that use the variable text instead of points in ggplot2, we use the geom_text() command.
ggplot(mtcars, aes(x = mpg, y = wt, label = cyl)) +
geom_text()
The label aesthetic must be specified in geom_text.
See here for more information about options in geom_text.
To make a scatter plot in ggplot2 with a straight diagonal line, we use the geom_abline() command.
ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
geom_abline(slope = -0.1409 , intercept = 6.0473)
NOTE: We are plotting two geoms on the same plot.
See here for more information about options in geom_abline.
To make a scatter plot in ggplot2 with a vertical and horizontal line, we use the geom_hline() (horizontal line) or geom_vline (vertical line) command.
ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
geom_vline(xintercept = 30) +
geom_hline(yintercept = 3)
NOTE: We are plotting three geoms on the same plot.
See here for more information about options in geom_hline or geom_vline.
Using the gapminder data:
Using dplyr:
We can filter the data to only include rows where year is 2007. We can do this two ways:
1. Filter gapminder data within the ggplot2 command
2. Filter the gapminder data and save the out as a new data frame, then read this data frame in as the data for ggplot2 command
geom_smooth is creating a line for every continent, why do you think this is?
Think about how inheritance of the aesthetics works.
We want to look at a single line, a global trend rather than continental trends.
Each geom can have their own local aesthetic set. If we move color and size from the top line into geom_point as aesthetics, we will now plot the global trend line with geom_smooth.
The code for geom_point should now look like this: geom_point(aes(color = continent, size = pop))
Try to alter the code to create this.
Over plotting occurs when there are too many data points. It then becomes hard to see any information about density of data when all points are solid black dots.
Information about the data becomes less clear with more data points.
These plots were created using this code.
library(car)
ggplot(Vocab, aes(x=education, y = vocabulary)) +
geom_point()
ggplot(Vocab, aes(x=education, y = vocabulary)) +
geom_jitter(alpha = 0.2, shape = 1, size = 1)
Over plotting is a huge issue when working with large data sets, but there are few things that you can do to attempt to minimize its effect.
Using the gapminder data:
What do you think the alpha option is doing?
To make a line plot in ggplot2, we use the geom_line() command.
USPersonalExpenditure <- as.data.frame(USPersonalExpenditure)
# To make this plot I need to alter the format of the data
USPersonalExpenditure_long <- USPersonalExpenditure %>%
rownames_to_column(var="Expense") %>%
gather(key = year, value = amount, -Expense)
ggplot(USPersonalExpenditure_long) +
geom_line(mapping = aes(x=year, y = amount, group = Expense))
See here for more information about options in geom_line.
rownames_to_column convert rownames to a columns within a data frame.
Use the gapminder data:
To make a box plot in ggplot2, we use the geom_boxplot() command.
ggplot(mtcars, aes(x = factor(cyl), y=wt )) +
geom_boxplot()
See here for more information about options in geom_box plot.
To make a violin plot in ggplot2, we use the geom_violin() command.
ggplot(mtcars, aes(x = factor(cyl), y=wt )) +
geom_violin()
See here for more information about options in geom_violin.
Using gapminder dataset:
What does a box in the plot indicate?
You can tell jitter a range to jitter within, see here. Adding position information to the geom_jitter command can alter the spread of the points.
Which one do you think is more informative, the box plot or violin plot?
Split up a large complex plot into multiple smaller plots by creating a separate plot for each category of a categorical (or discrete) variable.
To make a facet plot in ggplot2, we use the facet_grid() command.
ggplot(diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price)) +
facet_grid(color ~ clarity)
See here for more information about options in facet_grid.
See here for more information about a similar command called facet_wrap.
Using the gapminder data:
Using the diamonds data set:
Everything in ggplot2 can be altered. Theme refers to all the visual elements not part of the plot, like:
Absolutely everything can be modified.
There is a hierarchy to the theme elements with a complex break down written here about how inheritance between elements works.
Useful starting point is to use a set theme from ggplot2 and then make minor changes to the set theme. This can reduce the number of changes/modifications that you have to make.
There are also A LOT of built in themes. These can be really useful starting points, so that rather than tweak everything, you can use a built in theme and make minor changes from there.
Use built in themes
Using the gapminder data:
There are two ways to export plots in R.
The first is to click the Export button in the Plot window.
Ensure you are saving in the directory you want (default should be your set working directory), change the file name, and you can alter the size of the image. It is best to keep the aspect ratio the same as you alter the size.
The second way, and a way to ensure high resolution (most journals want 300dpi for plots), is to open a plotting file and print the plot to that file.
tiff("Figure_1.tiff", height = 10, width = 20, units = 'cm',
compression = "lzw", res = 300)
ggplot(mtcars, aes(x = mpg, y = hp)) +
geom_point()
dev.off()
The syntax is a bit strange.
You first open a file using the tiff() function.
Next you make your plot as usual.
Then you MUST close the plotting file using dev.off().
Using the diamonds dataset:
Make a scatter plot with “carat” on the x-axis and “price” on the y-axis. Color the points by “clarity”. Save the plot as a tiff called “Price_by_Carat.tiff”, with a height of 10 and width of 8, a res of 200 and using a lzw compression.
Ensure this is being saved in BeginnersR_Materials folder on your Desktop.
Humans read different information with differing levels of ease, and this should be taken into account when creating a graphic. Make it easy for your reader to be able to interpret the information.
It is also worth making graphics color blind friendly. There are numerous palettes within R/ggplot2 that have been created to help. See here for more information.
In general, color should only be used for non-continuous traits. Color differentiation of saturation and value is difficult, and our perception of them is skewed by the colors around them. This makes it harder to accurately compare across graphs whether color is the same or different.
Avoid using two different aesthetics to show the same information.
What is color telling us?
It is also worth making graphics color blind friendly. There are numerous palettes within R/ggplot2 that have been created to help. See here for more information.